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Abstract. Familial Searching is the process of searching in a DNA database 
for relatives of a certain individual. It is well known that in order to evaluate 
the genetic evidence in favour of a certain given form of relatedness between 
two individuals, one needs to calculate the appropriate likelihood ratio, which 
is in this context called a Kinship Index. Suppose that the database contains, 
for a given type of relative, at most one related individual. Given prior prob- 
abilities for being the relative for all persons in the database, we derive the 
likelihood ratio for each database member in favour of being that relative. This 
likelihood ratio takes all the Kinship Indices between the target individual and 
the members of the database into account. We also compute the corresponding 
posterior probabilities. We then discuss two methods to select a subset from 
the database that contains the relative with a known probability, or at least a 
useful lower bound thereof. One method needs prior probabilities and yields 
posterior probabilities, the other does not. We discuss the relation between 
the approaches, and illustrate the methods with familial searching carried out 
in the Dutch National DNA Database. 



1. Introduction 

Many countries maintain databases that contain forensic DNA profiles of traces 
and of certain known individuals, e.g. convicted offenders or suspects of certain 
crimes. These databases were originally set up to directly identify an unknown 
offender by looking for matching DNA profiles. However, since DNA is inhereted 
from parent to child, it is also possible to use them to look for the offender's relatives, 
rather than the offender himself, if the offender's DNA profile turns out not to be 
in the database. This last process is called familial (DNA) searching, and is carried 
out in several jurisdictions (e.g. the UK, some US states and New Zealand). As a 
result there have been some high profile successes (see e.g. [9] for the Grim Sleeper 
case). The Netherlands have recently adopted a law that allows familial searching 
in some cases. 

Previous studies on familial DNA searching have mostly concentrated on empiri- 
cal determination of the rank that a relative (of some target profile) in the database 
occupies, when the database is ordered according to decreasing likelihood ratio with 
the target, or according to decreasing number of shared alleles with the target. See 
e.g. [T] for simulations that also includes a geographical component (based on US 
states), or [6], [10] and [8]. In [7], false exclusion rates and false inclusion rates 
for various thresholds on the likelihood ratio and/or number of shared alleles are 
estimated by simulation. These rates are averages over different target profiles. 
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The contributions of this paper to the research on familial searching are as fol- 
lows. In part, we rediscover and collect several different statistical aspects of fa- 
milial searching that are currently scattered in the literature. But moreover, we 
present and extend these results in a unified mathematical framework that allows 
for a comparison of the different approaches to familial searching. In particular we 
extensively discuss the different frequentist meaning of the probabilities involved. 

In this paper, more specifically, we define various search strategies that allow 
one to determine or control the probability with which the relative can be found, 
if indeed it is in the database. We also explain how these search strategies can 
deal with heterogeneous databases in which the amount of information stored for 
its members may differ between members (e.g., different sets of autosomal loci). 
It turns out that there are different strategies that are equally effective, and we 
discuss their different probabilistic interpretations. 

The first method, which we call the conditional method, gives a probabilistic 
model that allows one to obtain posterior probabilities for relatedness that take all 
the database information (and the prior probabilities) into account. Focus on the 
posterior can also be found in [2] , but indirectly as the result of Bayesian network 
computation, whereas we derive these probabilities without needing such a tool. 

Our second method, which we call the target-centered method, selects a subset 
from the database that contains a relative with a certain probability. It essentially 
weighs false-negative (not selecting a true relative) probabilities against false pos- 
itive (selecting an unrelated individual) . The literature contains accounts of such 
approaches (e.g. [7j) but these studies focus on the average over all target profiles, 
whereas we focus on what happens in a particular case by taking the target profile 
as starting point. 

We investigate the effectiveness of a familial search for a specific DNA target 
profile, rather than the average over all profiles. As is to be expected, if the tar- 
get profile has more rare alleles, its relatives are easier to find then if it has more 
common alleles. This effect has been already noted (quantitavely) in ]5j and (heuris- 
tically) in 8j. We illustrate the results by a comprehensive simulation study using 
artificial targets in the actual Dutch National DNA database. Finally, we look at 
how hard half-sibling would be to find either with the appropriate index (the half- 
sibling index) or with the sibling index. It turns ou that half-siblings are, of course, 
quite hard to find but that a search with sibling index does not perform very much 
worse than one with the half-sibling index. 

In a familial search setting, we have a target item that we compare with database 
items among which, we suppose, at most one 'related' item is present. If so, we 
wish to find it, and we do so by computing likelihood ratios in favour of being 
a special item between the target and every database member. From these, we 
compute (given prior probabilities) posterior probabilities for relatedness between 
the target and a database member that take all the computed one-to-one likelihood 
ratios into account. 

What distinguishes a familial search from an ordinary kinship computation is of 
course the fact that we compute likelihood ratios with a whole database. Databases 
are bound to yield chance matches, meaning in this case a strong indication for a 
non-existing relationship. This is reminiscent of the classical database controversy, 
and some elements of that discussion reappear in the familial searching context. 
For example, an attempt to take into account the fact that a database search has 
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been done is described in [IT] where the SWGDAM Ad Hoc Committee on Partial 
Matches recommends that the kinship index between a database member and the 
target be divided by N, the size of the database, and to only further investigate 
this possible lead if that quotient is sufficiently large. As we shall see, this number 
does not represent the likelihood ratio in favour of relatedness. 

In our opinion familial searching has the potential to lead to the same kind of 
debate as the classical database controversy. As we will point out in the text, such 
a classical database search can in fact be viewed as a special case in the model that 
we present. 

Finally, let us mention that this paper can be viewed as a sequel to |13j , in which 
we have discussed the effectiveness of database searches where one is looking for 
a perfect match. In that case, only two likelihood ratios arise as the result of the 
comparison of the target's profile to a database profile: zero (in case of exclusion) or 
1 jp (in case of a match, and p is the profile frequency) . In this article we generalize 
this to the search for a database member that has a certain family relation to the 
target (including the possibility of identity). 

2. Database likelihood ratios 

We consider a database of DNA profiles, and we suppose at most one of these 
profiles comes from a person that is related to the offender, whose DNA profile 
we call the target profile. The form of relatedness is fixed: we are looking for a 
specific relative (e.g., the offender's father, the offender's brother, or simply the 
offender himself), not for any relative. We also assume that the database members 
are unrelated to each other. In reality, the database may of course contain several 
related individuals. We expect this to have a negligible influence when these people 
are only related to each other and not to the target. In the case however where 
the database contains more than one relative of the target, the situation becomes 
more complex. Nonetheless it is clear that the target becomes easier to identify in 
that case. Exactly how much easier will depend on the number and type of these 
relatives and warrants a separate publication. In this article we restrict ourselves to 
the case where at most one relative may be present in the database, and we believe 
that the results are indicative of those to be obtained in the more general setting. 

Let Pi,...,Pji represent the DNA profiles in database T>, by letting them be 
random variables that are all distributed as either S ("special") or G ("generic"), 
such that at most one of the Pi (the "special" member) is distributed as S. For 
example, if we are looking for brothers then S selects a DNA profile from a brother 
of the target, according to the probability distribution for DNA profiles of brothers 
of the target. By R = i we mean that Pi is distributed according to S, for 1 < i < N. 
By R ^ D we mean that all Pi for 1 < i < N are distributed according to G, i.e., the 
database does not contain a special item. Then R € T> has the obvious meaning. 
We assume that we know the "prior" probabilities 

TT, = P(R = i) 

for all i. Further, let ir-p = XaI=i w i ano - = 1 — ^d- 

We also suppose that all Pi are conditionally independent given which of them 
(if any) is distributed as S. This reflects that all population members are unrelated 
to each other and to the offender, except for the relative. 
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In our DNA context, the Pi take values in the set of DNA profiles. The distribu- 
tion of S will depend on the relation between the offender and the special member. 
If we are looking for the offender himself, then S will have a point distribution: it is 
equal to the DNA profile of the offender. If we are looking for the offender's father, 
then S will be equal to DNA profile g with probability equal to the probability that 
the offender's father has DNA profile g. On the other hand, the random variable 
G always represents a DNA profile of a random population member, and therefore 
the probability that G = g is equal to the expected population frequency of DNA 
profile g, irrespective of the type of search we are performing (looking for a direct 
match, a parent or child, sibling, et cetera.) 

The likelihood ratio LR SjG (Pi) expresses the support in favour of R = i: it says 
how many times more likely it is that database member i has the observed DNA 
profile if it is equal to the desired relative, than if it is unrelated to the offender. 
Such a likelihood ratio is often called PI (Paternity Index) when looking for a parent 
or child, and SI (Sibling Index) when looking for a sibling. 

If every database member is subjected to such a likelihood ratio calculation, we 
obtain a vector of likelihood ratios, that we denote by 

LR D = (r lt ...,r N ) = r. 

We also define 

= 1) 

which will be useful below. 

We will use these likelihood ratios from the database to make probabilistic state- 
ments concerning the identity of R. In the next chapter we will use those results to 
define search strategies for R in the database. One strategy relies on the following 
result, whose proof we defer to the Appendix. 



Proposition 2.1. For i = 1, . . . , N, we have 
(2.1) P(R = i | LR D = r) = 



and 

(2.2) P(R = i | LR D = r,R e V) = 



v^A 



Remark 2.2. A version of the above formula ( 2.1 ) has, in the context of DNA mix- 
tures, also been obtained in [3] (see also [4]). Their formulation uses probabilities 
rather than likelihood ratios in numerator and denominator. 



Remark 2.3. Note that the conditional probabilities in (2.1| and (2.2 1 do not 
depend on the distribution of S and G. 



It follows from Proposition 2.1 that, for any subset V C T> 

12ieT>' r i n i 



(2.3) P(R E V | LR D = r) 



In particular, with V = T> wc obtain 



v^A 



P(ReV\ LR D = r) _ Y%=i Wi 



P(R ^ V | LR D = r) 7T 
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and the likelihood ratio in favour of R £ D is given by 
(2 4) P(LR D = r I R G 2?) _ YuiLi r 



P(LR D = r | R £ V) tt v 

Note that the likelihood ratio depends on the prior probabilities. 

Corollary 2.4. In odds form, we obtain 

P(R = i | LR D = r) r^i 
P(R^|LR D = r) = E ;^ rfe ^ 

and the likelihood ratio in favour of R = i is given by 
(2 5) P(LR D = r | R = i) = njl-m) 

P(LR D = r | R ^ i) Y,k=i,k^i r k n k + tto 

In the case where the prior distribution of R on I? is uniform, the above derived 
formulas simplify, and for convenience we include them here. With P(R = i) = 
ttxi/N for all 1 < i < N we obtain, as special case of (2.4), 

P(LR D = r|R€P) _ 1 " 
{ 0) P(LR D = r |RrfX») ~ 

i— 1 

which is independent of the prior ttx>, contrary to the general case. In this uniform 
case, we see that the results LR D = r favour R G T> if and only if the average 
likelihood ratio on T> is greater than one. 

The posterior probability that R = i is now equal to 

P(R = i | LR D = r) 

with corresponding likelihood ratio 
P(LR D = r | R = i) 



(2.7) 



P(LR D = r|R/i) (ypN s , JV(i-7T D ) • 

In the even more specific case that 7Td = 1, i.e., the database surely contains a 
relative and any of the members can be the relative with equal a priori probability, 
we simply get 



P(R = i | LR D - r) = 

and 

P(LR D = r | R = i) ^ 
P(LR D = r = 57 L_^ M . rfe " 

Example 2.5. We can view the process of searching for a match with a DNA 
profile as a special case. In that case R corresponds to the trace donor, and S can 
only take value eo, the DNA profile in question. On the other hand G can take 
more values with probabilities given by the profile population frequencies. Let p 
denote P(G = eo), the random match probability of the profile eo- In this situation, 
LR(Pi) can take value or 1/p. The variables P 1; . . . , P N correspond to the members 
of a database in which we look for the profile e . Suppose that the i th database 
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member is the only one that matches. Then LR(Pi) = 1/p and all other LR(P k ) = 



(for 1 < k < N, k 7^ i), so the likelihood ratio (2.7) in favour of R = i becomes 



P(LR D = r | R = i) 1 N -ir v 



P(LR D = r|R^i) P N(1-ttt>)' 
If 7Tx> = N/n (the population fraction in the database), then this reduces to (n 



\)/{p(n — N)), and the likelihood ratio in favour of R E V is given by (cf. (2.4)) 

P(R = i) _ 1 
pP{R E D) ~ Np' 

These results are well known; see e.g. [13] and the references therein. 

3. Search strategies 

We will now use these results to define strategies to choose a subset of V as small 
as possible, and which contains R with a given minimal probability a. 

3.1. The conditional method. Let V k be the subset of V that corresponds to 
the k largest products r^i (with some arbitrary rule in case of ties). Furthermore, 
we let, for < a < 1, k a be the minimal k for which 

^ rj-Kj > a(ri7Ti H h r N ir N ), 

j£T> k 

that is, k a is the smallest k for which the corresponding sum of the likelihood ratios 
weighted with the prior probabilities is at least a fraction a of the total weighted 
sum. Finally, we write T> a for T> ka . Note that in order to determine whether or 
not i € T> a , one needs the full vector r. Note also that P(R € T> a ) depends on the 



distribution of S and G, but P(R € V a \ LR D = r) does not (cf. Remark |Z3[). The 
distribution of the cardinality of T> a also depends on the distribution of S and G 
(and on N). 

We now make two observations about the probability that the index R is con- 



tained in V a . First, in case P(R € T>) = 1 it follows from (2.1) that for the 
unconditional probability P(R £ D a ) we have 

(3.1) F(R£D tt )>a. 



Secondly, if we do not have P(R € V) = 1, then from (2.2) we have (for i — 1, . . . , N) 
that 

P(R = i|RG£>) = ^P(LR D = r | R E V)P(R = i \ R E £>,LR D = r) 

r 

= ^P( L R D = r |RGP)— ^ . 

Hence 

P(R E V a | R E V) = ^P(LR D = r | Re £>)P(R E V a \ R E X>,LR D = r) 

r 

= ^P(LR D = r|REP)E „™ 

r ieV a Z^fc=l r k^k 



> a V P(LR D = r | R E V) 



a. 
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where the inequality follows from the definition of T> a . The quantity 

P(R e V a \R e V) 

is called the efficiency of V a , the ability to select R given that R is in the database. 
We just showed that the efficiency of T> a is at least a. 

3.2. The target-centered method. Recall that we have described S as the ran- 
dom variable that selects DNA profiles of relatives, and G as the random variable 
that selects DNA profiles of unrelated individuals. These being random variables, 
the likelihood ratios in favour of relatedness for related individuals (drawn from S) 
and for unrelated individuals (drawn from G) also become random variables. By 
P(LR(S) > t) we mean the probability that the likelihood ratio for a related indi- 
vidual is at least t. For example, when looking for a sibling, this would correspond 
to the probability that the sibling index with a random sibling of the offender is at 
least t. Similarly, it also makes sense to write LR(Pi): this is the likelihood ratio 
obtained from population member i. 

For < a < 1, let t a > be the largest t for which 

(3.2) P(LR(S) > t) > a. 
We use these thresholds t a to define 

(3.3) V a = {i e V | LR(Pi) > t a }. 

In order to decide whether or not i € 2? Q , one only needs to know rj, and not the 
full vector r as in the case of V a . It follows that 

P(R G V a | R e D) = P(LR(S) > t a ) > a, 

so also the efficiency of V a is at least a. In fact, for every < a < 1, 

(3.4) P(R € V a ) > aP(R e V). 

Thus, we simply choose a threshold that is met by the real relative with proba- 
bility a, and admit anyone into T> a when the likelihood ratio is sufficiently big. 

Remark 3.1. Conversely, one could also select all database members whose like- 
lihood ratio is unusually big if they would be unrelated. This criterion has been 
proposed in |12j , in the context of deciding whether or not to further investigate the 
possibility that the suspect's relative matches a crime stain. In our terminology, 
they propose a threshold sp such that 

/3 = P(LR(G) > sp). 

If N is large then (since all but at most one of the database members are distributed 
as G) one expects a fraction j3 of the database to be selected into {i 6 T> \ LR(Pi) > 
sp}. The relation with T> a is as follows. We have 
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a < P(LR(S) > t a ) 

= ^F(LR(S)=x) 

x>t a 

= ^2 xP(LR(G) = x) 

x>t a 

= ^2 xP(LR(G) = x | LR(G) > t Q )P(LR(G) > t a ) 

x>0 

= E(LR(G) | LR(G) > t a )P(LR(G) > t a ). 

Now, let j3 be such that t a — sp, then 

a < P ■ E(IR(G) | LR(G) > sp). 

Clearly, a cannot be expressed in ft alone but depends also on the target that we 
are dealing with. It follows that when selecting a database subset according to the 
threshold sp, the probability that R is selected depends on the specific aspects of 
the case, whereas for T> a (and T> a ) it has a uniform lower bound a. 

Remark 3.2. Notice that this approach essentially compares false-negative proba- 
bilities (1 — a) with false-positive probabilities (P(LR(G) > t a )). This is also done in 
[TJ, but without focussing on a specific target profile. The results are therefore infor- 
mative about the average performance with likelihood ratio thresholds, not about 
results in a particular case. Moreover, various studies (cf. [7j>[5]) employ thresh- 
olds that are a combination of a threshold on the number of shared alleles (denoted 
IBS, identity by state) and a likelihood ratio threshold t. Suppose we define I?™'* as 
those database profiles that share at least n alleles with the target profile and for 
which the kinship index with the target profile is at least t. For a given target, T> n,t 
will contain a true relative with a certain probability, and will contain unrelated 
individuals with another probability. However, the Neyman-Pearson lemma implies 
that of all tests with the same false-negative rate, the likelihood ratio test has the 
smallest false-positive rate. In other words, the pair (n, t) will correspond to a 
false-negative rate 1 — a, and then V a which has this same false- negative rate, will 
have a better (more precisely, not a worse) false-positive rate than V n,t . There- 
fore, even though it can be computationally attractive to work with a threshold 
(n,t), conceptually these thresholds are outperformed by putting a threshold on 
the likelihood ratio alone. 

3.3. Comparison and interpretation. We have defined two subsets T> a and 
T) a , both with efficiency at least a. Nevertheless, there are important differences 
between these approaches that we wish to discuss here. 

First of all, T> a makes use of the prior probabilities 7T; = P(R = i), while T> a 
does not. For example, in case of familial searching, geographical information or 
age could play a role in the definition of prior probabilities P(R = i). Thus, T> a 
uses more information than V a , which seems to give T> a an advantage over T> a . 

There is, however, a reason why the use of T> a could be more appropriate in 
concrete cases. This reason has to do with the interpretation of the probabilities 
involved, and we explain this next. We can see V a as a random subset of V which 
contains all database members that have yielded likelihood ratios greater than or 
equal to a random threshold. The distribution of this threshold depends on the 
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distributions of both S and G (and on N, the size of the database). Therefore, a 
frequcntist interpretation requires re-sampling of the database. Indeed, we have 
defined a subset in such a way that, if we would construct it for many realizations 
of one copy of S among N — 1 copies of G, a fraction P(R £ V a | R £ D) of the time 
we would have included the copy of S. 

The interpretation of the probability P(R £ T> a ), on the other hand, is easier. 
Indeed, V a is a random subset of V as well, containing all database members that 
have yielded likelihood ratios above some threshold, but this time the threshold de- 
pends on the distribution of S only (through LR(S)). This allows us to make another 
frequentist interpretation: we choose a realisation of the database (according to G) , 
and then, keeping the database fixed, repeatedly add one copy of a realisation of 
S. We can think of P(R £ V a \R £ V) as the relative frequency of times we would 
find the special member in V a . This interpretation corresponds well with what one 
would intuitively understand by the probability of finding the relative since in the 
forensic practice, the database is (more or less) fixed. From this point of view it is 
more appropriate to use V a rather than V a and, importantly, it is also easier to 
explain to legal representatives what the probabilistic statement really means. 

The frequentist considerations above apply to the general framework we have 
discussed in this paper. In the special case of familial searching however, the draw- 
back of using T> a may not be that serious, for the following reason. We explained 
that for a full frequentist interpretation of T> a , one would need to resample the 
database many times, and that this does not correspond well to legal practice. 
However, what matters is not so much that we can interpret the full profiles in the 
database as being resampled, but that we can interpret the observed likelihood ra- 
tios as being resampled. Suppose that we treat various different familial searching 
cases (i.e., try to find relatives from various targets) with the same database. Then, 
when we compute likelihood ratios between the database and a new target, these 
likelihood ratios depend on the newly sampled target profile, and it is to be ex- 
pected that for an independent sequence of target profiles, the observed likelihood 
ratios corresponding to the fixed profiles in the database are more or less indepen- 
dent. To test this, in the next section we investigate using computer simulation to 
what extent the frequentist interpretation that we have for T> a is valid for T> a as 
well. That is, we draw many targets independently according to G (i.e., at random 
using population allele frequencies), and add their simulated relatives to the same 
database. We see how many of these relatives are found in T> a , on average over 
all targets. This we will compare to adding many relatives of the same target to 
resampled databases. 

Finally, we mention the fact that when the database is large and uniform priors 
are used, the sets V a and V a will be very similar. This is due to the fact that the 
law of large numbers implies that the sum of the likelihoods above t a divided by 
N will be close to a. Hence the random threshold associated with T> a (discussed 
above) will with very high probability be very close to t a . This argument can be 
made precise in the form of a limit statement in probability or almost surely. 

3.4. Heterogeneous databases. A forensic DNA database may consist of profiles 
that have different sets of loci typed. As a result, not all profiles stored in the 
database arc equally informative. Mathematically this means that instead of one 
pair (S, G), we have several such pairs. We now sketch how the search strategies 
deal with such a situation. For ease of exposition, we consider only the situation 
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where there are two such pairs, the generalization to a larger number being obvious. 
In the DNA context, this corresponds to a database that contains DNA profiles for 
two different sets of loci. We write the database V as a disjoint union T> = T>\ UX>2, 
where T>i corresponds to the hypothesis random variables and G^. 

3.4.1. Conditional method. Going through the proof of Proposition [2T] one checks 
that this expression also holds in this heterogeneous situation, with the understand- 
ing that if Pi £ Vj then Pi is distributed either as Sj (if R = Pi) or as Gj (if R 7^ Pi). 
The search strategy therefore need not be modified but its efficiency may change. 
For example, as the amount of genetic information in T>\ increases, Si approaches a 
point distribution at infinity. In the limit, if R £ T>\ then R £ T> a for all a > and 
hence the efficiency of V a is at least P(R <E T>i | R € T>). Thus, the heterogeneity 
of T> has an effect on the efficiency and cardinality of T> a . 

Another possibility is to perform the conditional method on each database part 
separately with its own efficiency parameter af, we may denote the resulting subset 
of V by V ai a2 = V* 1 U X>2 2 - In that case . 

P(R £ V ai '° 2 I R £ V) > aiP(R £ X>i I R £ V) + a 2 P(R £V 2 \R£V). 

In particular, choosing ct\ — 0*2 leads to an efficiency of (at least) a for V a ' a . Other 
choices of eti, 012 may lead to the same efficiency. If, for example, T>\ contains more 
genetic information than T>2 , then one may take advantage of this fact by letting 
ct\ > OL2- This means that the efficiency is greater in D\ than it is in 2?2, while the 
overall efficiency of D OL1 ' a ' 1 is equal to a = (a x — a2)P{R £T>i \ R £ T>) + a 2 - For 
these choices T> a,a and T> ai ' a2 have the same efficiency, but the expected cardinality 
of T> ai ' a3 can be smaller than that of V a ' a . 

3.4.2. Target-centered method. In this case as well, we can make use of the data- 
base's heterogeneity to define various search strategies with the same efficiency. 
Let, as in ( |3.2[ ), ii jCt > be the largest t for which P(LR(Si) > t) > a (where 
LR = LR Sl! Gi), i-e., we define the appropriate thresholds on the likelihood ratio for 
every type of database entry. Then we let 

V aua2 ={i£V x \ LR(Pi) > t ha } U{i£V 2 \ LR(Pi) > t 2 , a }, 

containing a database member if, considering the amount of information stored for 
this database member, the obtained likelihood ratio is sufficiently big. Then the 
efficiency of this strategy is simply, as it was for X> ai,a2 , 

P(R £ V aua2 I R £ V) > aiP(R £ X>i | R £ V) + a 2 P{R £V 2 \ReV). 

Using the same limit as for the conditional method, as T>i contains more informa- 
tion, an efficiency of ai = 1 can be obtained in the limit using t\ t i = 00, in which 
case the efficiency of T> ai , a2 will be at least P(R £ T>\ \ R G V), the same as for the 
conditional method. 

4. Familial Searching in the Dutch National dna Database 

4.1. Methods and notation. We have carried out simulation experiments using 
the Dutch DNA database, where we have carried out familial searches with artificial 
targets, looking for parent-child, sibling, and half-sibling relations. We will restrict 
ourselves mostly to the results of the sibling and half-sibling searches here, parents 
and children being relatively easy to find owing to the fact that they always share 
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an allele on each locus (barring mutations, but these are rare) whereas siblings (and 
of course half siblings too) need not share any alleles with each other. 

All our simulations were programmed in-house with Mathematica software. We 
let T>nl be the Dutch National DNA Database (as per mid 2010, all duplicate 
profiles removed and only considering the N = 99,979 profiles for which all ten 
SGMPlus loci were typed). Allelic ladders and allele frequencies were taken from 
V NL . 

According to these allele frequencies, target profiles C\, . . . , Cioo were sampled 
(pseudo)randomly to serve as the targets whose relatives we want to find using 
familial searching. For each of these target profiles we sampled 50,000 children 
and 50,000 siblings. Then we computed the likelihood ratios in favour of paternity 
(the Paternity Index PI) between the C, and their children, and those in favour of 
siblingship (the Sibling Index SI) between the Ci and their siblings. These allow 
us to estimate the thresholds t a (ci. (3.2)) for the paternity and sibling cases. 

We will sometimes write KI , for Kinship Index, when we mean that the discus- 
sion holds for any type of relative, in particular KI can stand for PI or SI. 

The DNA profiles in T>nl are labeled di, . . . , djyj they can be viewed as a sam- 
ple of independent copies of G that is fixed throughout. By KI(Ci,dj) we mean 
the kinship index between the target profile Ci and database profile dj. Thus, 
KI(Ci,dj) can, for each target separately, be interpreted as a realization rj of the 
random variables LR(G) in the preceding sections. 

We have also computed the random match probability (RMP) of each target 
profile. On a locus with alleles (a, 6), the RMP is equal to p a Pb{2 — <^a,fc) where p t 
is the allele frequency of allele i and 8 a ^ = if a ^ b and S a , a — 1. The RMP of a 
DNA profile is then the product over all involved loci, since we assume all loci to 
be independent. This is reasonable, since all loci are on different chromosomes. 



4.2. Total likelihood ratio with the database. For all 100 targets Ci, we com- 
puted the sums \KI(Ci,VjyL)\ = J2k=i ^I(Ci,dk). The mean \PI(d, T>nl)\ was 
102,200 (with sample standard deviation 94,500), the mean \SI(Ci,T>Nh)\ was 
93,500 (with sample standard deviation 42,200). These results seem consistent 
with what we expect, cf. Proposition |A.2| Indeed, since all target profiles were 
randomly generated, they do not have a true relative in the database, and hence 



E(\KI(C l ,V NL )\) = N according to Proposition A. 2 



4.3. Conditional method. First, we have compared the results of the conditional 
method when each familial search is performed in the same database (i.e., the Dutch 
National DNA database), with results obtained when each familial search is carried 
out in a new database. 



4.3.1. The conditional method in the Dutch National DNA Database. We have in- 
vestigated (by simulation) what the probability is that a relative of a fixed target is 
found in D^ L , where D/vl is the extension of T>nl with a relative of the considered 
target. We take a uniform prior distribution of R on T>nl- 

To do so, we have simulated relatives Ri.j (i = 1, . . . , 100; j — 1, . . . , 500) of each 
type (children and siblings), where Rij is a relative of target profile Ci. For each 
relative Ri.j, define its rank to be equal to k if and only if there are exactly k — 1 
database members that have a greater kinship index with Cj than Rij. We also 
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define 

_ ^2x:KI{C i ,d x )>KI{Ci,R ili )^^^'i''^) 
h3 ~ KI{d,d{) + ■■■+ KI(Ci,d N ) + KI(d, R~j) ' 

Assuming a uniform prior of R on T>nl, Uj is the greatest t > such that Rij £ 
D* NL . Thus, Rij € T>% L if and only if a > t t j. 

For each Cj and for a e {0.01, . . . , 0.99, 1}, we have compared a to the fraction 
Pi a °f U,j that are smaller than a; this fraction ^ a is the observed probability 
for relatives of to be in "D%i,. Finally, we have also computed f3 a as the average 
over all /3j a . Thus, /3 Q estimates the probability that if one adds a relative R of a 
random target profile to this database T>nl, that R is in T>^ L . 

The probability that the relative of a target C is in T>% L is called the probability 
of detection (POD) for C in the Dutch National Database T>nl- The number @ a 
therefore gives an estimate of the average (over all targets) probabilities of detection 
POD. Note that a POD is only defined in connection to a fixed database, in this 
case 2?at,l. 

For siblings, the result of our simulations is displayed in Figure [T] 



Figure 1. The average POD as a function of a, average over 100 
target profiles, Sibling Index. 
PoD 




0.2 0.4 0.6 0.8 1.0 

For all a the average POD of T>^ <!L is at least a. For small a, it exceeds a 
substantially and as a increases, the average POD of T>^ L approaches a. This is a 
consequence of the definition of T >C ^ L as being a subset that contains the k greatest 
PI for some k. As a increases and i>% L becomes larger, we add individuals with 
smaller SI, and we expect /3 a to become closer to a. However, the variation between 
target profiles was substantial. We highlight three very different results in Figure 
[2] This difference is created by the presence (or absence) of database members that 
have a large sibling index with the target, thus obscuring (or revealing more easily) 
the real sibling, especially for low probabilities of detection. 

4.3.2. The conditional method in resampled databases. In this section we compare 
the above estimates of the probabilities of detection with estimates of the efficiency 
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Figure 2. The POD as a function of a, for three target profiles. 





of the conditional method in resampled databases of roughly the same size as Z?atl. 
To do so we have, for each target profile d as above, simulated 100 relatives i?- • and 
databases X>j j with N = 100,000. As in the previous section, we have determined 
tij as the largest a such that E!^ £ 2?-*-, and used these numbers to determine 
P' ia and (3' a whose definitions are analogous to their earlier counterparts. 

The observed overall efficiency j3' a is extremely close to the observed probabilities 
of detection j3 a displayed in Figure [I] In fact, the difference between the /3' a of 
this section (the average efficiency) and the j3 a in Figure [l] (average probability of 
detection) is on average over a € {0.01, . . . , 0.99, 1} equal to —0.0022 and never 
greater (in absolute value) than 0.0084. 

We have also, for each Rij, computed its rank fejj (defined in Section 4.3.11. 
A summary of the results is presented in Figure [3j where we plot for each a e 
{0, 0.01, . . . , 0.99}, the average rank of relatives for which tij is nearest to a. This 
gives the average rank of a relative that would be found in T> a , but not in T>P for 
(3<a. 



Figure 3. Observed pairs (ti.j, Log 10 (Mean fcjj)) 
Mean Log 10 (*,-j) 



«.*•••* 



20 40 
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4.3.3. Comparison of simulations. The simulations indicate that, even though there 
is a conceptual difference between probability of detection and efficiency, the average 
probability of detection in the Dutch National DNA Database coincides with the 
average efficiency of V a for databases of the same size. In other words, the average 
probability of detection in T> NL , averaged over all targets, is the same as the average 
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efficiency of T> a (averaged over all targets) for a database T> of the same size as 
T^nl- Therefore, even if we (as we do in practice) do not resample the database 
but look for relatives in the same database for all targets, then the probability of 
finding a relative in T> a when database and relative are resampled is the same as 
the long term success rate of finding the relative of varying targets in T> a while the 
database is kept constant. On the other hand, for V a , the interpretation of the 
efficiency P(R £ T> a | R € T>) does not require resampling of the database, hence 
also holds in a fixed one. This makes the T> a method easier to interpret. 

4.4. Target-centered method. Recall that we have, for each of the target profiles 
Ci, simulated 50,000 siblings, in order to estimate the profile-dependent thresholds 
t a needed for T> a . Some resulting sizes of 2?atl iQ , for a = 0.70, 0.80, 0.90 are plotted 
in Figure [4] where each dot represents a target profile. The horizontal axis contains 



-Log 10 (i?AfP) with RMP the random match probability (cf. end of 4.1 1. The 
mean sizes of V a are 85, 258, 1038 respectively, and we notice a tendency for V a 
to be smaller for profiles with a smaller random match probability (i.e. for which 
— Log 10 (i?AjfP) is larger). This is to be expected: for such profiles t a will be greater 
and it will be more unlikely for an unrelated person to have a sibling index with 
the target exceeding that threshold. The observed mean sizes of T>NL,a are similar 
to what has been obtained in resampled databases with the conditional method (cf. 
Figure [3}. 

FIGURE 4. Size oiV NL . a for a=0.70, 0.80 and 0.90, Sibling Index 



12 13 14 IS 



4.5. Rank in Dutch database. Another possibility for a forensic lab is to fix in 
advance the number of individuals that are additionally tested. Especially when 
doing so, it is interesting to know how high one expects the relative to rank when 
the database is sorted according to decreasing kinship index. For each of the target 
profiles Ci, we have generated 1000 siblings and investigated what their rank would 
be if it would have been added to the Dutch National DNA database. The result, 
averaged over all targets, is displayed in Figure [5] This graph should be compared 
to Figure [3] 

In particular, the true sibling was ranked first in 31% of the cases, in the top 10 
in about 52% of the cases, in the top 100 in about 73% of the cases, and in the top 
200 in about 79% of the cases. 

Of course, substantial differences were observed between the target profiles. For 
two target profiles with very different results, we have generated 10.000 siblings for 



These profiles are the same 
The profile corresponding 



each of them. The results are visualized in Figure 
ones that give rise to the first two figures of Figure 

to the upper graph in Figure [6] is, for a SGMPlus profile, rare (its random match 
probability being 10 -17 ' 8 ) whereas the other profile is quite common (its random 
match probability being equal to 10~ 114 ). 
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Figure 5. Probability for a sibling to obtain Si-rank at most n, 
in Dutch database 
Fraction of sibs in top-n (%) 
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Figure 6. Probability for a sibling to obtain Si-rank at most n, 
in Dutch database: two extreme cases and the average 
Fraction of sibs in top-n (%) 
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Notice that, whereas for the profile corresponding to the upper graph, the proba- 
bility for a sibling to rank in the top-5 is 67%, whereas for the profile corresponding 
to the middle graph the probability to rank in the top-300 is slightly less, 64%. 
This illustrates the dramatic differences between different familial searches when it 
comes to the ease of retrieving a sibling from the database. The middle graph is 
the average also displayed in Figure [5j 

4.6. Half-siblings. Finally we investigate exactly how hard it is to find half- 
siblings in the Dutch DNA database. It is well known that, using independent 
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autosomal markers as we do here, it is impossible to differentiate between half- 
siblings, grandparent-grandchild, and uncle-nephew relationships (at least, in the 
absence of mutation but in practice also when mutation is taken into account with 
realistic mutation rates). Thus the discussion holds in fact for all these types of 
relatives, but we will speak of half-siblings only. 

Since it is to be expected that half-siblings that genetically look sufficiently like 
full siblings will be found when a familial search for full siblings is performed, 
the question is how many half-siblings are found in such a familial search for full 
siblings. 

We created an artificial database T> with 100.000 SGMPlus profiles (10 loci), 
and considered 100 artificial target profiles. For each of these, we added 500 times 
a half-sibling to the database and then sorted to extended database according to 
decreasing SI or HSI with the target. 

The distribution of the ranks that the half-siblings obtained when the data- 
base was sorted according to decreasing SI or according to decreasing HSI had the 
behaviour displayed in Figure [7] As the graph indicates, half-siblings are more 
efficiently found with a HSI-based ranking than with a Sl-based ranking, but nev- 
ertheless the Si-list does not perform very poorly. 



Figure 7. Probability (%) for a half-sibling to obtain Si-rank and 
HSI-rank at most n, in SGMPlus database, N = 100.000 




We also performed a simulation in which half-siblings were added to a database 
consisting of 100.000 artificial 15-locus NGM profiles. The result, which is similar, 
is displayed in Figure [8] below. We conclude from this that, if the database contains 
a half-sibling but is only searched for siblings, then there is a quite non-negligible 
probability that this half-sibling will be found in a sufficiently high position in 
the Si-ranked list for it to be detectable. When additional DNA tests are done, 
one should be aware of this possibility; the fact that target and database member 
have different Y-STR profiles may then be used to infer that they cannot be full 
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siblings, but it must be taken into account that they may nonetheless still be 
related as half-siblings, uncle-nephew, or (perhaps less likely in a DNA database) 
grandparent-grandchild. 




5. Discussion 

We have defined and investigated several search strategies that construct a sub- 
set from a database with DNA profiles in such a way that the probability that they 
contain the relative of the target, if present in the database, with a certain minimum 
probability. We end this article with a brief recapitulation of their main properties 
and how this compares to other research published on familial searching. First, 
there is the option to use or not to use prior probabilities. Using prior probabilities 
has the advantage of also obtaining posterior probabilities, as specified in Proposi- 
tion 2.1 It turns out that ranking according to posterior probability is the same 
as ranking according to the a priori probability multiplied by the likelihood ratio. 
Hence, in the case of a uniform prior, the database can simply be ranked according 
to likelihood ratio. This is, of course, well known. What can also be derived from 
this article however, is how much genetic similarity one expects by chance. Indeed, 
whatever the database's size, the type of relative, or the number of loci included 
in the comparison, statistically one expects the sum of the likelihood ratios with 
all database members to be equal to the number of people in the database, if it 
does not contain a relative. This is also true for direct matches, but in that case 
databases are too small compared to the obtainable likelihood ratios to see this 
effect. In contrast, when looking for relatives, our simulation study shows that 
for a database with 100,000 members and ten investigated loci, the observed total 
likelihood ratio agrees well on average with this prediction. In casework, this can 
give a practical check: if the sum of likelihood ratios is of comparable magnitude 
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as the database size, there is no strong indication for the presence of the target's 
relative. It does not matter if this total is obtained by one large likelihood ratio or 
by several smaller ones. 

If one does not use prior probabilities, then it is still possible to keep the prob- 
ability with which the relative is detected under control by using the fact that the 
likelihood ratios between the target and his relatives have a distribution that can 
be derived from the target's DNA profile. This means that a lab that carries out 
familial searching can choose to fix either the desired probability of detection, the 
minimum likelihood ratio that warrants additional testing, or the number of addi- 
tionally tested individuals. This also is of course well known, but with the results 
presented here one is able to derive the unfixed two parameters from the fixed one. 
For example, a lab may decide to guarantee a certain probability of detection. If 
it does so, the likelihood ratio threshold for additional DNA testing as well as the 
number of additionally tested individuals will vary from case to case. They can be 
computed or estimated before a search is carried out, which can be helpful in the 
phase where the familial search details are discussed with the investigating author- 
ities, as it will give a clear picture of what to expect. On the other hand, a lab may 
decide to additionally test a fixed number of individuals in every familial search. If 
it does so, then the results presented here allow to say what the probability is, in 
a specific case, that a relative ranks sufficiently high. 

We have also discussed the somewhat subtle relation between these two ap- 
proaches. They are different, yet can have the same efficiency. The reason behind 
this is that the target-centered method uses other information: admission of a 
database member into V a is on the basis of the likelihood ratio of that database 
member alone, whereas for admission into V a the product of that likelihood ratio 
with the a priori probability has to be sufficiently big, in comparison to that of 
the others. The probabilities for the relative to be found by both strategies have a 
different frequentist interpretation: the conditional method uses more information 
but needs resampled databases for its frequentist interpretation, and the target- 
centered method uses less information but does not need resampled databases. 

Finally, a good question is if one method seems preferable over another. There 
does not seem to be, in our opinion, a definitive argument to prefer any of the 
methods over the other one. As just pointed out, the conditional method has 
the advantage of being able to deal with prior probabilities. Hence, if these are 
definable, it makes sense to prefer this method. On the other hand, its probabilistic 
interpretation is more complex. If no prior odds exist, then the target-centered 
method has the advantage of having naturally interpretable probabilities, and the 
expected outcome of a search can be simulated beforehand, since it only depends on 
the size of the database and the amount of genetic information that it contains. It is 
therefore very suitable to perform a case pre-assessment: one can obtain estimates 
of how many individuals must be additionally tested, if one looks for a certain 
type of relative with a certain probability. Indeed, all that is needed is the target 
profile, so no comparisons with the actual database are required. Prior to having 
obtained permission for a familial search, the target-centered method therefore 
allows a feasibility study. Once permission has been obtained and the likelihood 
ratios with the database have been computed, one can then use the conditional 
method to take these into account. This allows, for example, to make statements 
about the probability that a relative is found in the particular top-n at hand, as 
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opposed to what would a priori be expected, e.g. based on the results of sections 
1431 and PI 

Appendix A. 



In this appendix we provide the proof of Proposition 2.1 Before doing so, we first 
state some properties of general likelihood ratios which are probably well known, 
but since we are unaware of an explicit reference, we include them with proofs as 
well. 

Proposition A.l. Suppose that for all e € E we have 
(A.l) P(S = e) > => P(G = e) > 0. 

Then we have, for all x > 0, 

P(LR(S) = x) _ 
P(LR(G) = x) ~ X ' 

Proof. Denote the part of E on which the likelihood ratio takes value x by 

E x = {e G E | LR(e) = x}. 



We write 



P(LR(S) = x) 



E p ( s = e ) 

E LR(e)P(G = e) 

E p ( G = p ) 



eS-E 
X 



xP(LR(G) = x). 



□ 



Proposition A. 2. Under assumption (A.l I we have £'(LR(G)) = 1. 
Proof. We write 



£(LR(G)) = E L H e ) p ( G = e) 

eG-E 

= E p ( s = e ) = 1 - 



eG-E 



□ 



Proposition A. 2 can be interpreted as expressing that for every choice of like- 
lihood ratio, there will always be chance matches (likelihood ratios in favour of S 



whereas the data were generated by G), as long as (A.l I holds. Moreover, if we 
expect fewer chance matches, then these matches will be stronger to the effect that 
the expected likelihood ratio is constant. 

Proof of Proposition \2.1\ Recall that the Pi are independent random variables dis- 
tributed either as S or as G, with exactly one of them distributed as S. The database 
consists of individuals 1, . . . , N . We write D = (P 1} . . . , P M ) for the corresponding 
random vector, and 

LR D = (LR(P 1 ),...,LR(P N )), 
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for the random vector representing the likelihood ratios that we obtain from the 
database. 



With this notation, we first prove (2.1 1. The required probability is equal to 



P(LR D = r | R = i)P(R = i) 
P(LR D = r) : 



which can be expanded as 



P(LR D = r | R = £)P(R = i) 



J2k=i p ( lr b = r | R = fc)P(R = k) + P(LR D = r | R £ V)P{R $ V) 
Therefore, it is more attractive to consider the reciprocal, and we obtain 

N 



1 



P(R 



LRn 



E 

k=l 



P(LR D = r | R = jfe) P(R = k) 



P(LR D 



R : 



P(R: 



P(LR D = r I R g V) P(R £ V) 



P(LR D = r | R = i) P(R = i) ' 

Recall that all Pi are conditionally independent given R. This means that the last 
expression reduces to 

JV 



E 



P(LR(Pi) =n\R = k) P(LR(P k ) = r k | R = k) P(R = k) 
^ P(LR(P 1 ) = r l \R = i) P(LR(P k ) = r k \ R = i) P(R = i) 

P(LR(Pi) = r, | R £ V) P(R ^ V) 
+ P(LR(Pi) =n\R = i) P(R = i) ' 
We claim that for k ^ i, we have 

P(LR(P 1 ) = n | R = fc) 
P(LR(Pi) =n\R = i) 

and 

P(LR(P k ) = r k | R = fc) 



(A.2) 



1 

n 

r k - 



^ A ' 3 ^ P(LR(P k ) =r k \R = i) 

To see this, note that given R = k, P k is distributed as S, and Pi is distributed as 
G, and then use Proposition |A.l[ For k — i, the corresponding term in the sum is 
equal to 1. We can also apply Proposition A.l to the last term since R ^ V implies 
that Pi is distributed as G. From all this it follows that 

JV 



P(R: 



1 

i I LR D 



E 

fc=i 



r k P(R = k) 1 P(R i D) 



P(R 



n P(R : 



and ( 2.1 1 follows 



The proof of ( 2.2 ) is similar, we only sketch the difference with the proof of ( 2.1 1 



The probability in question is equal to 

P(LR D = r | R = i)P(R = i) 



P(LR D =r,R£l>) 



which can be expanded as 



P(LR D = r | R = i)P(R = i) 
Ef=i ^(LRd = r I R = k)P{R = k) ' 
From this point on, the proof proceeds as above. 
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